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L ' , Abstract 

Q^ , We propose a new graph-theoretic benchmark in this paper. The benchmark is developed to address 

• ■ shortcomings of an existing widely-used graph benchmark. We thoroughly studied a large number of 

f\ ' traditional and contemporary graph algorithms reported in the literature to have clear understanding of 

their algorithmic and run-time characteristics. Based on this study, we designed a suite of kernels, each 
of which represents a specific class of graph algorithms. The kernels are designed to capture the typical 
,. ^ run-time behavior of target algorithms accurately, while limiting computational and spatial overhead 

, lir , to ensure its computation finishes in reasonable time. We expect that the developed benchmark will 

f—^ ' serve as a much needed tool for evaluating different architectures and programming models to run graph 

j-y-j , algorithms. 

p' 

^ ! 1 Introduction 

Graph algorithms have become an increasingly important and widely-used tool in a wide range of emerging 
disciplines such as web mining, computational biology, social network analysis, and text analysis. Typically, 
input to these algorithms are the graphs that consist of typed vertices and typed edges that represent the 
^^ . relationships between the vertices. The vertices and edges of the graphs are often associated with some 

H ' attributes. 

These graphs, which belong to a class of graphs called scale- free graphs (or networks) 4 , have very 
complex structures. Furthermore, the graphs that are usually formed by fusing fragmental information 
obtained from many different sources like web documents and news articles tend to grow in size as more data 
becomes available, and thus graphs with billions of vertices and edges have become prevalent in practice. 
Running the graph algorithms on the large and complex graphs in an efficient and scalable way has become 
an increasingly important and yet challenging problem. This raises an imperative need to identify computer 
architectures that are best suited for solving graph theoretic problems for large complex graphs. 

There are many aspects to consider in an architectural evaluation. An ideal architecture should provide 
high performance and good scalability for target applications. It should also be energy-efficient, cost-effective, 
and easy to program. In addition, it should offer reliable, large-capacity storage subsystems in order to pro- 
cess large data sets, for graph theoretic applications in particular. These aspects must be carefully weighed 
in when evaluating various machine architectures, in conjunction with the run-time behavior and resource 
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usage pattern of the target applications. Benchmarks have been the most commonly used architectural eval- 
uation tool, since they are specifically designed to represent the key characteristics of the target algorithms 
and hence correctly reproduce their run-time behavior, A well-designed benchmark is critical to the accurate 
architectural evaluation. 

Although innumerable benchmarks have been developed for a wide spectrum of applications |18[ 1141 [S] , 
little attempts have been made in developing benchmarks for graph-theoretic applications. Many exist- 
ing graph-theoretic benchmarks consist of graph data sets designed mainly for the algorithmic evaluation 
of specific graph algorithms [571 [3]. Recently, a new benchmark suite, called DARPA High Productivity 
Computer Systems (HPCS) Scalable Synthetic Compact Application (SSCA) Graph Analysis Benchmark 
(commonly known as SSCA#2 benchmark), was developed [3]- The SSCA#2 benchmark suite is comprised 
of a synthetic scale- free graph generator [lOj and four kernels each of which is designed based on a small set of 
fundamental graph-related algorithms. The benchmark has found some successes as it is the only benchmark 
currently available designed specifically for architectural evaluation for graph-theoretic problems. However, 
the SSCA#2 benchmark has some notable drawbacks that make the benchmark inadequate to use in any 
rigorous performance study. First, the benchmark is not a complete representative of a wide variety of 
commonly used graph algorithms and therefore fails to provide comprehensive models for their algorithmic 
behavior. Some of the kernels in the SSCA#2 benchmark are not graph-related, but closer to to graph 
construction and min-max finding. Remaining kernels, although they certainly address fundamental and 
very important graph problems, covers only a fraction of existing graph algorithms. Second, the design flaws 
in some of the kernels prevent the benchmark from accurately modeling the real execution-time character- 
istics of the targeted algorithms. For example, its kernel calculating the betweenness centrality scores [7] 
approximates the calculation by finding shortest paths between only randomly selected vertices due to high 
computation cost. This may result in inaccurate betweenness measures, and more importantly, this may 
alter the real run-time memory access characteristics of the original algorithm, since the subgraph formed 
by the shortest paths found may have significantly different structures from the original graph. 

We address these issues and propose a new graph-theoretic benchmark. The benchmark is comprised of 
a graph generator and a suite of kernels. The graph generator synthesizes scale- free graphs using a very fast 
algorithm based on preferential attachment method |101 147) . The kernels are comprehensive and designed to 
represent a wide range of important graph algorithms, including traditional graph algorithms such as search, 
combinatorial optimization, and metrics computation and graph-mining methods, like subgraph clustering 
(also known as community finding) and spectral graph algorithms, that have gained more popularity in recent 
years. The benchmark is designed to accurately model the run-time characteristics of target algorithms with 
minimal computational cost. Furthermore, its specific and detailed design allows the objective evaluation of 
the strengths and weaknesses of different machine architectures and programming models. 

The paper is organized as follows. Section [2] describes some preliminaries and graph representations used 
in this paper. The proposed benchmark is described in Section |4] in greater detail. Related work is reported 
in Section [5] followed by concluding remarks in Sections |6l 

2 Preliminaries 

2.1 Definitions 

The proposed benchmark uses undirected, weighted, colored graph G ~ {V, E), where V is the set of vertices 
and E is the set of edges. Each vertex is given an integer that uniquely identifies the vertex. The vertices 
are colored in gray scale, where the darkness of the vertex is controlled by a real value (e.g., for white and 
1 for black). That is, a vertex w^ S ^ is a pair {i,Ci), where < z < \V\ and < Ci < 1. Here, i and Ci 
represent the ID and the color of the vertex Vi. An edge e = {u, v, *) is said to be incident to u and v. The 
degree of a vertex v is defined as the number of (undirected) edges that are incident to v. An edge Ci G E 
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Figure 1: Adjacency list graph representation for graph G — {V,E), where n — \V\. 

is the tuple (u, v, Wi), where u,v E V and Wi is the weight of the edge such that < Wi < 1. The graph is 
undirected iff for every {u, v, w) £ E, there exists {v, u, w) G E. There are no self-loops in G. That is, for 
every (m, v,w) G E, u ^ v. Also, the G does not have multiple edges between any two vertices u and v. A 
graph G' = iV',E') is a subgraph of G, when V C V and E' C E. A subgraph G" = {V , E') is called a 
clique^ if for any two vertices u, w e V' , {u, v, *} e E' . 



2.2 Graph Representations 

As stated earlier, the performance of a typical graph algorithm is mainly governed by its memory access 
pattern. Therefore, the graph representation has major impact on the performance of the graph algorithms. 
Since there can be a large number of graph representations and each of the representations has different 
memory behavior, it is very important to utilize the most appropriate graph representation schemes to 
accurately model the run-time behavior of the graph algorithms. 

We adopted two of the most common graph representations, adjacency list and compact sparse row (CSR), 
to store graphs in our benchmark. We chose these graph representations methods for several reasons. First, 
because it is critical for our benchmark to be used to evaluate a variety of machine architecture with different 
memory capacities, we have excluded those representations, such as adjacency matrix, that usually consume 
a lot of memory space to store given graph. Furthermore, the graph of interest in this research, scale-free 
graph, is a sparse graph, and hence we mainly focused on the graph representations ideal for sparse graphs. 
Second, these are two of the most widely used methods for graph representation. Finally, these two methods 
represent graph representations at the extreme ends of spectrum of memory access patterns, where the linked 
list representation exhibits more random memory accesses than the CSR representation. We believe using 
these two widely different graph representations will provide users with ability to measure the best- and 
worst-case memory performance of the target architecture. 

The adjacency-list graph representation is shown in Figure [TJ As Figure [1] shows, the color and adjacent 
vertices of the vertices are stored in two arrays, Vertex_Color and Adjacency. The color of a vertex 
Vi is stored in Vertex_Color [i] . The adjacent edges and vertices of Vi are maintained as a linked list. 
Adj acency [i] points to the first element in the linked list. Each node in the linked list contains the ID of 
a vertex, u, adjacent to Vi and the weight of edge that connects v and u. Given a vertex ID, j, the adjacent 
vertices and the weights of corresponding incident edges can be obtained by simply following the linked list. 

This graph representation is not very efficient in terms of storage space, since an additional space is 
required to store a pointer for each edge. Further, a graph algorithm that processes a graph in this rep- 
resentation is likely to exhibit random memory access behavior and hence suffers poor performance, since 



Vertex_Color Adj acency 






S 




: 


i 


c, 


i+1 


Ci+I 


i+2 


^,.2 


n-1 


: 


'^n-l 



p q 1 



I 



Edge_List 







p 






q 










1 














... 




















... 











Edge„Weights 



Figure 2: Compact sparse row (CSR) graph representation for graph G — {V, E), where n = \V\. 

accessing adjacent vertices needs chasing pointers to objects that are highly hkely dispersed in the heap area. 
It should be noted, however, that it is never our intention to design the most efficient graph benchmark. 
Rather, our goal is to provide a good architectural evaluation tool to users. 

The CSR graph representation is one of the most popular graph representation techniques, especially ideal 
for representing sparse graphs such as scale-free graphs. Its popularity can be attributed to the fact that 
it requires minimal storage space to store graphs while reducing random memory accesses. The CSR graph 
representation is depicted in Figure [H In the CSR graph representation, all the edges and their associated 
weights are stored in two arrays, Edge_List and Edge_Weights. Therefore, the size of these arrays is equal to 
the total number of edges in the graph (i.e., \E\). The adjacency information is maintained through another 
array, Adjacency. Unlike the adjacency list representation, where each element in the Adjacency array 
points to list of edges that are incident to the corresponding vertex, in the CSR representation an entry in 
the Adj acency array maintains the starting positions in the Edge_List array for all the edges that are incident 
to the corresponding vertex. In Figure [21 for example, the adjacent vertices of Vi is stored in Edge_List [p] 
. . . Edge_List [g — 1] , and the adjacent vertices of Vj is stored in Edge_List [g] . . . Edge_List [Z — 1] , and 
so on. Also, the degree of the vertex Vi is q—p- The vertex colors are also stored in Vertex_Color as in the 
same was as in the adjacency list representation. 

Figure[3]presents a sample graph and its corresponding CSR graph representatiorl^. The vertex colors and 
edge weights are depicted in blue and red in Figure |31 a, respectively. A directed graph is used in the example 
for the simplicity of presentation. In an undirected graph, each (undirected) edge is translated into two edges 
in reverse direction and represented in the same way as the directed graph case. In this example, vertices 
and 5 do not have any outgoing edges. This is indicated by the fact that the entries in the Adjacency array 
for these vertices have the same value as the ones immediately following them. It should be also noted that 
the Adjacency array keeps one additional entry at its end basically to indicate the array boundary for the 
last vertex (vertex 7 in the example). 



3 Graph Generator 

Recent studies have revealed that many important real- world graphs belong to a special class of graphs called 
scale-free graphs. Some of these scale-free graphs include World-Wide Web [8], Internet [211 [211 EZl l44] . 
electric power grids [46] , citation networks [29l [38l |42l [43] , telephone call graphs [2" , and e-mail network [20] . 



^The adjacency list representation is not shown here, because it is relatively straightforward to represent the graph in this 
data structure. Interested readers should refer to 1131 
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(a) Sample directed graph 
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Figure 3: A sample graph and its corresponding CSR representation. 



It is very important to understand the behaviors of different graph algorithms on the key real-world graphs, 
and hence, the proposed benchmark includes a graph generator that constructs synthetic graphs that possess 
key characteristics of the real- world scale- free graphs [47] . 

The graph generator is based on well-known preferential attachment graph model [4] and offers significant 
advantages over the R-MAT based graph generator 10 of the SSCA#2 benchmark. Our graph generator has 
low computational complexity of 0(m), compared to 0{m ■ logm ■ logn) of the graph generator in SSCA^2 
benchmark, where n and m represent the number of vertices and edges of a graph, respectively. Furthermore, 
our graph generator was shown to generate more realistic scale-free graphs and can be easily parallelized, 
compared to the SSCA:j^2 graph generator 47 . The preferential attachment method [19], which our graph 
generator is based on, follows well-known generative Barabasi- Albert (BA) model [4]. In the BA model, a 
scale-free graph is constructed by repeatedly creating a new vertex and then attach it to one of the existing 
vertices. Here, the existing vertex is selected with a probability distribution which is proportional to its 
current degreeij. The graph generation algorithm is described in Algorithm [I] in detail. 

The algorithm begins with constructing a clique of a given size. Vertices in the initial clique will form 
a set of high-degree vertices called hubs. Selecting an existing vertex based on its current degree can incur 
very high computation overhead, especially for larger graphs, since entire vertex set needs to be scanned for 
each newly added vertex. Our graph generator maintains a list that contains the end points of all the edges 
in partially constructed graph (steps 5 and 15 in[T]) in order to speed up this selection process. The basic 
idea here is to select a vertex from the list of end points. Since the number of occurrences of each vertex 
in the list is equal to its degree, the probability that a vertex is selected is proportional to its degree. This 
enables the creation of edges in a constant time and reduces the overhead considerably. 

The degree of a newly added vertex follows a uniform distribution with given average vertex degree as its 
mean (step 10). This is to ensure that the generated graph does not have uniform structure. Each edge in 
the graph is associated with a weight between and 1, and similarly, each vertex in the graph has a color, 
which is also a real value between and 1 (steps 9 and 14). These values are used by some of the kernels in 
the benchmark, as explained later. 

Parallelizing the graph generator is also very simple. In the parallel graph generator, each processing 
element (PE) (e.g., processes or threads) maintain its own list of end points in the same manner as in 
the sequential generator. The source vertices of the edges in a list are owned by a processing element 
that maintains the list. However, the destination vertices may belong to other processing elements. When 
the generation local edges is complete, a final graph synthesize by merging all the edges with the same 



■^The BA model is also called rich get richer model, since in this model those rich vertices (vertices with high degree) have 
higher probability to be connected to newly created vertices and hence become richer. 



destination end points. Since processing elements create local edges independent from each other, this 
algorithm is embarrassingly parallel. Interested readers should refer [47] for more detailed description of the 
parallel graph generator. 

4 Description of Graph Benchmark 

We designed the proposed benchmark with a set of clear design goals. First, the benchmark should be 
comprehensive in that it should cover a set of fundamental and more commonly used graph algorithms so 
that it can present a collection of varying resource usage and execution patterns typical graph algorithms 
exhibit during their execution. Second, the benchmark should be designed in such a way that it captures 
and models only the essential algorithm characteristic of the target algorithms. This requires a thorough 
understanding of target algorithms. Third, the benchmark should be very specific about algorithms, graph 
representations, and all data structures used by the algorithms so that ones can make unbiased evaluation 
of target architecture. Finally, the benchmark should be computationally tractable so that runs can finish 
in a reasonable amount of time. 

As the first step in designing the proposed benchmark, we surveyed the graph algorithms that arc com- 
monly used in various important applications in practice. The survey was performed to make out benchmark 
as comprehensive as possible by ensuring that the new benchmark covers an extensive list of key graph al- 
gorithms. Furthermore, this study should help us to great extent better understand and correctly capture 
the run-time behavior of the target graph algorithms. The surveyed graph algorithms are classified into 5 
groupqj. The algorithms in each category are modeled, mainly focusing on their algorithmic and run-time 
characteristics, and translated into a kernel. These kernels of the proposed benchmark are described in detail 
in the following. The detailed description of the proposed benchmark design is given below. 

4.1 Kernel 1: Graph Search 

4.1.1 Graph search algorithms 

Algorithms in this class pertain to traversing paths in given graph and finding the information associated 
with the paths and (typically) vertices on the paths. These algorithms are fundamental graph algorithms that 
have been applied to solving a variety of scientific and engineering problems. In this research, we surveyed 
many classical graph algorithms that perform a form of graph search, including breadth-first search (BFS), 
depth-first search (DFS), Dijkstra's algorithm. Prim's algorithm, Kruskal's algorithm, and Bellman- Ford 
algorithm. 

We will not discuss these algorithms in further detail in this paper, as the detailed description of these 
algorithms is widely available in the literature [13], but we should point out key characteristics of these 
algorithms governing their run-time behavior. Although they deal with the vertices and edges in a given 
graph, the primarily operate on sets obtained from the graph. The BFS algorithm, for example, accesses 
two sets, which maintain the vertices at the current level and all the visited vertices, when it expands the 
search to the next level. Similarly, other algorithms in this class also access elements in sets in one way or 
another. Here, the way the elements in the sets are maintained and accessed dictates the run-time memory 
access patterns of these algorithms. If a set is unordered (as in the case of BFS), then most likely, accesses 
to the elements in the sets can be done sequentially, while accesses to ordered sets involve more random 
memory accesses. In addition to the set operations, adjacency list, accesses to which are usually in random 
order, also affects their run-time memory behavior. 
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It has been generally believed that graph algorithms, especially the graph search algorithms, have poor 
cache utilization due to their random memory accesses. Use of efficient data structures to maintain the 
sets (e.g., priority queues) and represent graphs (e.g., compressed sparse row (CSR) graph representation), 
however, may allow the search algorithms to access the memory in more sequential manner, in contrast to 
the popular belief. We aim at capturing this behavior correctly in this kernel. 

4.1.2 Benchmark description 

Algorithm [2] describes the kernel 1 in detail. The design of the kernel is essentially based on the single-source 
shortest paths problem [13] . Here, the length of a path is the sum of weights of the edges that comprise the 
path. The input to the kernel is graph, G{V, E) to be searched and the source from which search begins. 
The kernel returns an array of size |V|, where z-th entry has the length of path from s to Vi. There can be 
many variations of algorithms to find all the shortest paths from a single source, but the proposed kernel 
combines the behavior of the BFS algorithm and ordered set accesses. 

A queue of vertices is used to maintain a set of unvisited vertices and denoted as Q. Since the algorithm 
needs to find a vertex with the smallest current path length from s (line 10 in Algorithmic]), the elements in 
Q should be maintained in an increasing order of their current path lengths. That is, Q is an ordered set. 

When a vertex v is selected from Q, its path length is the minimum path length from s to w and will 
not be improved by any other paths. Hence, the vertex v is removed from Q. Then, all of its neighboring 
vertices are added to Q, if they are not already in Q, followed by relaxing their path lengths. The relaxation 
of a vertex basically updates the current length of path to the vertex, if there exists a path of shorter length 
to the vertex [T3]. The relaxation is performed in lines 26 to 30 in Algorithm [21 

4.2 Kernel 2: Spectral Graph Analysis 

4.2.1 Spectral graph problem 

The graph algorithms in this class of algorithms are related to the spectral graph theory [15l [161 HZj- 
Essentially, the spectral graph theory is the study of properties of a graph in relationship to the eigenvalues 
and eigenvectors of its corresponding adjacency matrix. In other words, the algorithms in this class treat 
a graph as a matrix and aim at finding the eigenvalues/eigenvectors of the matrix. Prominent exemplary 
algorithms in this class include page rank [36] and random walk with restart (RWR) [40] . These algorithms 
compute eigenvectors, each value in which is typically interpreted as the relative importance or relevance of 
the corresponding vertex to other vertices in the graph. 

The eigenvalue/eigenvector problem has been studied quite extensively and a number of eigensolvers are 
available [HI [5S] . Also, it is well known that the algorithms in this class generally do not scale well and their 
performance and scalability are greatly affected by graph representation method. We attempt to model the 
essence of the run-time characteristics of these algorithms in this kernel. 

4.2.2 Benchmark description 

The proposed kernel relies on the simple power method [26| to find an eigenvalue for given graph. The power 
method finds the largest (dominant) eigenvalue and its corresponding eigenvector. Though simple, the power 
methods works pretty well for well-conditioned sparse graphs such as the graphs targeted by the proposed 
benchmark. This decision is justified by our design objective to model the behavior of the algorithms in the 
domain of the spectral graph theory while guaranteeing the completion of the computation in a reasonable 
time. 

The input to the kernel are graph G(V, E) and threshold values that basically used to limit the number 
of iterations. The kernel produces an eigenvector for the largest eigenvalue as output. The eigenvector is 



initialized with the same constant, but any random numbers can be used. The function Multiphcation in 
hues 22 to 24 in the Algorithm [3] performs a matrix- vector multiplication. The convergence is determined 
by the condition, - — rrj^. — -. Here, || X \\ denotes the Li norm of X. If the ratio is smaller than given 
threshold, e, then the iteration stops. Because the power method can converge very slowly, we limit the 
number of iterations by another input parameter, m, which represents the maximum number of iterations 
allowed. 

It should be noted that implementation of this kernel must not convert given graph representation into 
an equivalent adjacent matrix representation. The reason being for this restriction is that although the 
adjacency matrix graph representation will facilitate certain operations in the kernel (i.e., lines 16 to 17 in 
the Algorithm [3]) , accessing the matrix will have memory access patterns that are very different from the 
accesses to other graph representations. 

4.3 Kernel 3: Vertex and Edge Accesses 

4.3.1 Adjacency finding 

The most frequently used graph operation is without doubt, given a vertex, finding a set of its adjacent 
vertices. In fact, most of the graph algorithms need to access adjacency data. The graph search algorithms, 
for example, must access adjacent vertices to a given vertex to traverse paths. Those algorithms that 
explore the graph structure, such as maximal clique finding and subgraph pattern matching, also need the 
vertex adjacency. These algorithms typically access the data structures that store the adjacency information 
repeatedly to find neighboring vertices. Therefore, accesses to the adjacency information dominates the 
run-time memory access patterns of many graph algorithms, especially the graph algorithms that operate 
primarily on the adjacency data. The repeated adjacency finding operations usually result in a memory 
access behavior that is characterized by a combination of random and sequential memory accesses. 

A graph transformation algorithm called graph hierarchicalization [I] is of particular interest because this 
algorithm repeatedly accesses the adjacency information, and hence, our kernel for the vertex and edge 
accesses is modeled based on this algorithm. The graph hierarchicalization basically transforms a graph into 
a hierarchy of graphs, where the graph at each level provides a view of the graph in different granularity. 
More specifically, the graph hierarchicalization algorithm gradually converts the given graph into a graph 
of smaller size by iteratively coalescing vertices and edges of the given graph. Mapping information is also 
created to map the vertices and edges in two adjacent graphs in the hierarchy, although this is not included 
in this kernel. Since the graph hierarchicalization can abstract and reduce the size of a graph, it is mainly 
used for graph summarization and visualization. 

Another aspect we try to capture by this kernel is the behavior of the algorithms that require frequent 
updates to the given graph. We incorporated the updates by replacing the coalesced vertices that with a 
(mega) vertex that represent the coalesced vertices and reconnecting appropriate edges in this kernel. 

4.3.2 Benchmark description 

Input to the kernel is graph G(V, E) and a constant 7, which is used to control the number of coalescing 
performed. Basically, the 7 is used to compute the number of vertices from which the coalescences start. As 
stated above, the number of coalescences is determined by 7, which in turn is used to compute the number 
of vertices with which the coalescences are performed (lines 3 and 7) . 

In each coalescence, a vertex selected from the graph in random. Then, the vertex is coalesced with 
its adjacent vertices. The coalesced vertices are kept in S. Also, the average color of the vertices in S 
is computed for the new vertex that represents S. The lines from 13 to 20 rearranges the edges that are 
connected to those vertices in S in such a way that they are connected to the new vertex for S. Once 
coalescing vertices and reconnecting edges, all the vertices in S are replaced by the new vertex representing 



S (lines from 21 to 24). Since this kernel alters the structure of the original graph, it is necessary to keep 
track of the changes made so that the original graph can be restored (line 27). It is allowed to save the entire 
graph disk, if bookkeeping of the changes consume too much resources in terms of time and space. 

4.4 Kernel 4: Graph Metric 

4.4.1 Graph metric computation 

Various metrics are measured and utilized in the area of graph mining and graph structure analysis. Some 
of common graph metrics include degree distribution, clustering coefhcients, and modularity. Clustering 
coefficient of a vertex, proposed by Watts and Strogatz [45 , is a strong indicator of how densely the vertex 
is connected with its neighboring vertices and can be used to identify the centroids of potential clusters [llj . 
Similarly, the modularity |.35j is a metric that directly indicates the density of a subgraph and is widely used 
in many community finding algorithms [31] [351 [331 [331 [311 [HI [H] • The betweenness centrality [7] of a vertex, 
on the other hand, is a good indication of the likelihood of the vertex belonging to a cluster. 

Computation of these graph metrics generally shows random memory access patterns, since it is closely 
related to and heavily relies on adjacency finding. We selected one of the most popular ans simplest graph 
metrics, clustering coefficients, as a base model in designing this kernel. 

4.4.2 Benchmark description 

This kernel takes a graph G{V, E) and a constant m as its input. The constant m is used to control the 
number of vertices for which the kernel finds clustering coefficients. The computation of the clustering 
coefficient (in lines 6 to 16 in Algorithm [S] is straightforward and we will omit its description. It should be 
noted, however, that the kernel counts each edge as a directed edge in its calculation. This is because each 
undirected edge is represented as two directed edges of reverse direction in the graph representations used 
in our benchmark. 

4.5 Kernel 5: Global Optimization 

4.5.1 Global optimization problem for graphs 

There are a class of graph algorithms that finds solutions by optimizing certain objective functions. Many NP- 
Complete graph algorithms that solve combinatorial optimization problems belong to this class of algorithms, 
although they are not of interest in this research mainly due to their high computational complexity. We 
are mainly interested in those graph algorithms which follow greedy method [13] , These algorithms compute 
solutions by iteratively constructing a sequence of (partial) solutions. At each iteration, a new solution 
is constructed in a way to improve given objective function. A well-known graph partitioning algorithm 
MeTis |28) . for instance, partitions a given graph into a fixed number of partitions in such a way to minimize 
the cuts between partitions to reduce the communications on message-passing parallel computers. A classic 
community finding algorithm by Newman and Girvan constructs a dendrogram of vertices that results in 
the largest modularity value [35) . 

The objective functions employed by these algorithms are usually global measures that require access and 
process entire graph to compute. This global accesses in fact make it very difficult to parallelize this class of 
algorithms. The algorithms in this class are modeled in our benchmark to capture this algorithmic behavior. 
Our benchmark models a well-known community finding algorithm called AUTOPART ^ . This algorithm 
iteratively divides a given graph into a set of communities in such a way that the value of an entropy-based 
objective function decreases. The AUTOPART algorithm is an ideal representative algorithm for this class of 
algorithms to model, because it incurs relatively low computational overhead while exhibiting the run-time 
behavior common to the algorithms in this class. 



4.5.2 Benchmark description 

The input to the kernel is graph G{V, E) and two constants, a, and m. The a is mainly used to reduce the 
search space by filtering out a set of vertices whose color is less than a (line 3 in Algorithm [5]) . The m is 
used to control the number of iterations to ensure the computation finishes within a reasonable among of 
time. 

The kernel basically partitions the vertices of G into a set of clusters. The vertices are grouped together 
in such a way that the grouping will improve the overall density of groups. The density of a group is defined 
as the ration of the number of internal edges to that of external edges. Here an internal edge refers to edge 
that connects two vertices in the given group of vertices. External edge, on the other hand, is an edge one 
of whose endpoints does not belong to the group. 

The kernel maintains a set, C, of groups of vertices. It first selects a group g £ C that has the minimum 
density value. Then, it tries to split g into two by iteratively checking moving a vertex from g to the new 
group will improve the overall density (lines from 12 to 24). The objective function used by kernel is defined 
by a function called Objective in the Algorithm [S] The objective function sums up the value of Shannon's 
binary entropy function [39] for each group in C. The kernel continues the split of C until it cannot split 
any of the groups in C (lines 18 to 20) or its iteration reaches given m. 

5 Related Work 

Although there are numerous graph algorithms reported in the literature, little work has been done in graph- 
centric benchmark development. The most relevant work in this area is probably the recently developed 
benchmark suite, called DARPA High Productivity Computer Systems (HPCS) Scalable Synthetic Compact 
Application (SSCA) Graph Analysis Benchmark, which is commonly known as SSCA#2 benchmark [3]. 
The SSCA#2 benchmark suite is comprised of a synthetic scale-free graph generator based on RMAT 
method [T^ and four kernels. Each of the kernels is designed based on a small set of fundamental graph- 
related algorithms. It is the only benchmark currently available for architectural evaluation for graph- 
theoretic problems. However, the SSCA#2 benchmark suffers significant shortcomings. One of the biggest 
drawbacks is that the benchmark is not comprehensive. Also, among the four kernels in the benchmark, 
kernels 1 and 2 are not directly graph-related. Remaining kernels suffer design flaws that prevent the 
benchmark from accurately modeling the real execution-time characteristics of the targeted algorithms. In 
addition, its use of RMAT as a synthetic graph generator results in longer graph generation time and graphs 
that do not resemble common real-world scale-free graphs. 

Gokhale et. al. [23] have evaluated the use of existing and emerging computing architectures, storage 
technologies, and programming models to solve data-intensive problems in various scientific and engineering 
disciplines. One of the data-intensive applications considered in their research was graph algorithms. In 
particular, they were interested in the graph algorithms that access very large graphs stored in external 
storage devices. In an effort to investigate the use of the new storage devices and identify what is needed 
for the efficient and scalable computation of graph applications, they developed a graph-centric benchmark 
that operates on out-of-core graphs. This benchmark measures the performance and scalability of the graph 
ingestion and search. 

A library that contains a set of the most commonly used graph algorithms has been developed |30j . The 
library, called Boost Graph Library (BGL), enables the reuse of graph algorithms and data structures by 
providing the users with a generic interface that allows access to a graph's structure while hiding the details 
of the implementation. Therefore, any graph algorithms or libraries that implement this interface will be 
able to access the BGL generic algorithms and other algorithms that use the interface. The BGL offers 
some of the fundamental graph algorithms that include breadth- and depth-first search algorithms, Dijk- 
stra's shortest path algorithm, Kruskal's and Prim's minimum spanning tree algorithms, and find strongly 



connected components algorithms. In addition to the graph algorithms, the BGL supports common graph 
representations: adjacency _list and adjacency .matrix. The BGL itself can be used as a benchmark, since it 
supports a wide range of fundamental graph algorithms and common data structures to represent graphs. 
However, it lacks a means to fine control the behavior of these graph algorithms, which any real benchmark 
is expected to provide. 

6 Conclusions 

Graph has become an increasingly important and widely-used tool in a wide range of emerging disciplines 
such as web mining, computational biology, social network analysis, and text analysis. These graphs are 
usually very large and have very complex structures. Running the graph algorithms on these large and 
complex graphs in an efhcient and scalable way has become an increasingly important and yet challenging 
problem, which raises an imperative need to identify computer architectures best suited for running graph 
algorithms. There exists, however, no good benchmark that is designed specifically for graph algorithms. 

We develop and present a new graph-theoretic benchmark in this paper to address this issue. The bench- 
mark is comprised of a very efhcient graph generator and six kernels. We thoroughly studied a large number 
of traditional and contemporary graph algorithms reported in the literature to have clear understanding 
of their algorithmic and run-time characteristics. The developed kernels are comprehensive and correctly 
model the common behavior of a wide range of important graph algorithms. We expect that the developed 
benchmark will serve as a much needed tool for evaluating different architectures and programming models 
to run graph algorithms. 
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Algorithm 1 Graph Generation Algorithm 
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// Create a graph G(V, E), where V and E are sets of vertices and edges of G 
// L is hst of edges 
Create a chque C of size c 
for all V edges (u, w) e C do 

L ■i^ L\J {u,v} 
end for 

for i ^— c to |F| — 1, where \V\ denotes the number of vertices in G do 
id[M] -s— i 

color [u] -s— umform(0, 1) 

d <r- uniform(l, 2-D), where D is average vertex degree of G 
for j <— to d — 1 do 
k <— uniform(0, |-L|) 
V ^ Lk 

weight[(u, u)] <— uniform(0, 1) 
L <— L U {u,i;} 
end for 
end for 



Algorithm 2 Kernel 1: Graph Search 
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Input: G{V, E) and a vertex s 

Output: Array W, where W[ti] is the weight of path from s to u 

Create an array W of size |Vl 

for all Vm G F do 

W[u] <~ oo 
end for 
W[s] ^ 
Q^{s} 
while Q 7^ do 

u ^ Extract31in(Q, W) 
for all Vw e Adjacent(u) do 
\i V ^ Q then 

Q ^QiJ {v} 
end if 

w <— weight of e(u, v) 
Relax(W, u,v,w) 
end for 
end while 

function Extract_Min(Q, W) 

v^v' gQ such that W[w'] < W[m] for \fu e Q 

return v 

end function 

function Rclax(W, u, v, w) 

if W[v] > W[v] + Weight(u, v) then 

W[v] ^ W[u] + Weight (u, v) 
end if 
end function 

function Adjacent(u) 
return {v \ {u, v, *) G _E} 
end function 



Algorithm 3 Kernel 2: Spectral Graph Analysis 
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Input: G{V,E), e and m 

Output: An array X 

Let n = |V| 

Create an array C of size n 

for all Vi, where i — . . . n — 1 do 

C[i] = k, such that k ~ Yl^Zo Weight(i, j) 
end for 
for all Vj, where j = . . . n — 1 do 

1 n 

end for 

t^ 

^ oo 

while t < m and > e do 

A*+i ^ Multiplication(A*) 

i ^ j + 1 

t ^t + 1 

end while 

function Multiplication(A) 

return A such that A^ = z, where z — X)i=o '^i=o ^ ' "'^i ^^'^ {hJi^) G E 
end function 



Algorithm 4 Kernel 3: Vertex and Edge Accesses 
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Input: G{V, E) and 7 (reduction ratio) 

Output: G'(V', E'), where G' is a new coalesced graph 

n <- 7- I V" I 

i^ 

V ^ V 

E' ^ E 

while i < n do 

Let u be vertex randomly selected from V' 

S <— {u} + Adjacent(u) 

Let u' be a new vertex representing S 

c ^ E.Color[z; e ^]/| S \ 

Color [m'] ^ c 

for all Vi; e S* do 

for all Ve — (wi, f2, w) G E' such that v ^ vi do 

Update e — (u, ti2, w) 
end for 
for all Ve = (wi, ^2, w^) G E' such that or f = f 2 do 

Update or e ~ {vi,v,w) 
end for 
end for 
for all Vi; e 5* do 

E' ^ E' - {v} 
end for 
E' ^ E' Uu' 
i ^ i + 1 
end while 
Restore G 



Algorithm 5 Kernel 4: Graph Metric 



Input: G{V, E) and m 

Output: An array CC that contains the clustering coefficients of m vertices 

Create array CC of size m 

i^ 

for all i < m do 

Let u eV be a vertex randomly selected from G 

S ^ Adjacent(M) 

c^ 

for all Vw e S* do 
S" ^ Adjacent(w) 
for all Vu' e S" do 
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c <- c + 1 
end if 
end for 

CC[tl\ ^ |5|.(|5|_i) 

end for 

i ^ i + 1 
end for 



Algorithm 6 Kernel 5: Global Optimization 



Input: G{V,E), a, and m 

Output: fc, the number of clusters that maximizes objective function 

Let G'(V', E') be a graph, where Y' = {v e V| Color[w] > a} and E' = {e = (u, v, w) eE \ u,v e V} 

i^O 

k^ 1 

.91 ^ V 

Ci = {gi} 

while i < m do 

9min ■^ 9j such that Qj = raiUg^Ci Density(^) 

Qnew "^ '/' 

for all Vi; G ^mm do 

Qnew ^ Qnew U \1^J 

if Objective(Ci yj grain) > Objcctive(Ci U {gmin - {v}) U {gnew U {v})) then 

grain ^ grai7i l^J 
gnew ^ gnew U l^J 

end if 

if gnew = then 

Stop 
else 

^i ^ ^i U grain U gnew 

k ^ k + l 
end if 
end for 
i <- i + 1 
end while 
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function Density ((/) 

Si -i^ {e — (u, v,w) ^ E' \ u €z g and w G g} 

5e ^— {e = (u, Vjw) E E' \ e ^ Si and (u e g or u G g)} 

end function 

function Objective(C) 
return J2geG H(Density(g)) 
end function 

function H(p) 

return —p ■ log2P — {I — p) ■ log2{l — p) 

end function 



